1 Introduction

The goal of this project is to provide an insight of what are the main features that affect the filtering rate at port. Additionally, we aim to construct a predictive model that will be able to give an accurate result for the filtering rate beforehand.

2 Dependencies

2.3 Load data to R

## [1] 582  20
##     hum_pct         tasa_filt        x10_pct          insol       
##  Min.   : 7.950   Min.   :410.0   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 8.460   1st Qu.:534.5   1st Qu.:21.72   1st Qu.: 6.550  
##  Median : 8.670   Median :567.7   Median :23.01   Median : 7.590  
##  Mean   : 8.703   Mean   :570.2   Mean   :22.98   Mean   : 7.733  
##  3rd Qu.: 8.850   3rd Qu.:604.0   3rd Qu.:24.20   3rd Qu.: 8.850  
##  Max.   :10.080   Max.   :717.4   Max.   :28.63   Max.   :13.000  
##  NA's   :2        NA's   :2       NA's   :1       NA's   :1       
##      Na_gpl       sulfato_mg_l   cloruro_mg_l      conduct    
##  Min.   :157.0   Min.   : 459   Min.   : 57.0   Min.   : 901  
##  1st Qu.:487.5   1st Qu.:1534   1st Qu.:129.0   1st Qu.:2945  
##  Median :580.0   Median :1701   Median :154.0   Median :3490  
##  Mean   :565.4   Mean   :1714   Mean   :159.6   Mean   :3341  
##  3rd Qu.:650.5   3rd Qu.:1880   3rd Qu.:187.0   3rd Qu.:3870  
##  Max.   :863.0   Max.   :2709   Max.   :331.0   Max.   :4830  
##  NA's   :3       NA's   :3      NA's   :3       NA's   :3     
##        ph             cot            ca_ppm        tft_ca_pct    
##  Min.   : 6.25   Min.   : 0.00   Min.   : 24.0   Min.   :0.1000  
##  1st Qu.: 9.50   1st Qu.:10.00   1st Qu.:206.0   1st Qu.:0.2000  
##  Median :10.01   Median :14.50   Median :272.0   Median :0.2600  
##  Mean   :10.01   Mean   :15.52   Mean   :287.8   Mean   :0.2631  
##  3rd Qu.:10.65   3rd Qu.:19.50   3rd Qu.:338.0   3rd Qu.:0.3000  
##  Max.   :12.03   Max.   :62.00   Max.   :720.0   Max.   :0.7800  
##  NA's   :3       NA's   :34      NA's   :32      NA's   :161     
##     Cu_S_gpl         tft_visc         tft_floc      sol_filt_pct  
##  Min.   :0.0200   Min.   : 2.585   Min.   :0.581   Min.   :57.20  
##  1st Qu.:0.1100   1st Qu.: 5.440   1st Qu.:1.358   1st Qu.:61.55  
##  Median :0.1600   Median : 6.240   Median :1.741   Median :61.99  
##  Mean   :0.1972   Mean   : 6.329   Mean   :1.951   Mean   :62.15  
##  3rd Qu.:0.2400   3rd Qu.: 7.040   3rd Qu.:2.260   3rd Qu.:62.79  
##  Max.   :1.1700   Max.   :12.185   Max.   :8.678   Max.   :65.03  
##  NA's   :3        NA's   :1        NA's   :54      NA's   :4      
##  sol_recep_pct        temp         esf_corte         na_count     
##  Min.   :53.67   Min.   :17.90   Min.   : 2.508   Min.   : 0.000  
##  1st Qu.:61.62   1st Qu.:22.30   1st Qu.: 5.643   1st Qu.: 0.000  
##  Median :62.07   Median :25.20   Median : 7.059   Median : 1.000  
##  Mean   :61.97   Mean   :25.41   Mean   : 7.365   Mean   : 1.179  
##  3rd Qu.:62.42   3rd Qu.:28.60   3rd Qu.: 8.855   3rd Qu.: 2.000  
##  Max.   :64.34   Max.   :31.70   Max.   :15.763   Max.   :16.000  
##  NA's   :2       NA's   :61      NA's   :313

3 Data engineering

3.1 Check variable data type

## There are 0 remaining columns with character values

Since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. The code performs the following tasks: rename variable names, change data type to factor and order ordinal factors.

4 Descriptive statistics

Descriptive statistics describe quantitatively the basic features of the data. These statistics will give us a head start by providing information about stuff like skewness, outliers (range) missing data points and (near) zero variance.

4.1 Data profile

5 Missing values

5.1 Frequency

With this plot we can see that the variables that present the highest amount of missing data are: - esfuerzo de corte - tft_ca_ppt - temp - tft_floc - cot - ca_ppm

5.2 Missing data representated geometrically

The variables mentiones above are represented graphycally to search for any pattern. In this graph, the data which is missing, is given a value of 10% less than the minimum value of the available data. Then, it’s represented in a dispersion graph were the red color represents the missing data.

Fig. 1

Fig. 1

There is no insight in regards to any pattern correlated to the filter rate.

5.3 More information about missing data and combinations.

In this part, we aim to answer the question: in which variables observations are missing, and how many? Aggregation plots are a useful tool for answering these questions. The one-liner below is all you need.

## 
##  Variables sorted by number of missings: 
##       Variable       Count
##      esf_corte 0.537800687
##     tft_ca_pct 0.276632302
##           temp 0.104810997
##       tft_floc 0.092783505
##            cot 0.058419244
##         ca_ppm 0.054982818
##   sol_filt_pct 0.006872852
##         Na_gpl 0.005154639
##   sulfato_mg_l 0.005154639
##   cloruro_mg_l 0.005154639
##        conduct 0.005154639
##             ph 0.005154639
##       Cu_S_gpl 0.005154639
##        hum_pct 0.003436426
##      tasa_filt 0.003436426
##  sol_recep_pct 0.003436426
##        x10_pct 0.001718213
##          insol 0.001718213
##       tft_visc 0.001718213
##       na_count 0.000000000

It’s difficult to give any insight in regards to missing data.

5.4 Imputation

Finally, we decide to imputate the data. The technique used is Predictive Mean Matching (PAM) algorithm. Two data frames are generated, one with imputated data df, and the other one with only the rows that have no missing data df_port.

6 Correlations overview

After engineering new features and before starting the modelling, we will visualise the relations between our parameters using a correlation matrix. For this, we need to change all the input features into a numerical format. The visualisation uses the corrplot function from the eponymous package. Corrplot gives us great flexibility in manipulating the style of our plot.

What we see below, are the colour-coded correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation (dark blue and dark red in the plots below).

Here, we are of course interested if and how strongly our features correlate with the tasaDeFiltrado, the prediction of which is the ultimate goal of this challenge. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the collinearity in our data set and improve the robustness of our prediction:

6.1 Correlation matrix with imputated data.

We find:

  • Na_gpl is correlated to ph, which is expected since sodium is a base and as concentration increase the pH must increase as well.
  • As the amount of calcium which is given by ca_ppm and tft_ca_pct increases the ph is lowered. This is a similar behaviour when Acid runoff depletes the water’s alkalinity.
  • The amount of missing data na_count decreases when the amount of Na_gpl increases. Somehow, with more sodium it’s more likely that more data is present. Inversly, when sodium levels decreases there is more missing data. Whe have no reponse as to which is the cause and which is the consequence of this behaviour. The analysis shown above is also applicable, but to a lesser extenct to conductivity. -The amount of missing data na_count increases as the amount of ca_ppm and tft_ca_pct increases.

6.2 Correlation matrix with dropped missing data.

Analyzing only the sampled data shows news observations, wich shouldn’t be expected. For instance, Na_gpl appears to be highly correlated to the filtering rate whilst this was not observed in the imputated data set.

This is a tricky situation since we can’t give a good feedback on what’s going on. Why does the operator take some samples whilst in other he decides to omit? We ought to talk with the people in charge of the sample and do further analysis before we can giver an

7 Data preparation

8 Machine learning model

8.2 Correlation

##            xgbTree    glmnet  glmnet.1
## xgbTree  1.0000000 0.8155238 0.7828588
## glmnet   0.8155238 1.0000000 0.9052348
## glmnet.1 0.7828588 0.9052348 1.0000000

8.3 Performance summary

## $RMSE
##                Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## xgbTree  0.06330538 0.06548392 0.06717139 0.06817997 0.06920201 0.07741116
## glmnet   0.07242736 0.07840845 0.07941515 0.07962689 0.08113677 0.08645527
## glmnet.1 0.06897974 0.07354095 0.07741870 0.07866448 0.08477519 0.08762063
##          NA's
## xgbTree     0
## glmnet      0
## glmnet.1    0
## 
## $Rsquared
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## xgbTree  0.3809804 0.4155011 0.4727092 0.4511680 0.4836834 0.5061176    0
## glmnet   0.1665313 0.2340235 0.2740467 0.2715755 0.3244529 0.3434975    0
## glmnet.1 0.1229701 0.2110392 0.3034446 0.2960307 0.3969368 0.4298480    0

8.4 Summary

## The following models were ensembled: xgbTree, glmnet, glmnet.1 
## They were weighted: 
## -0.3852 0.8189 0.0774 0.1639
## The resulting RMSE is: 0.0681
## The fit for each individual model on the RMSE is: 
##    method       RMSE      RMSESD
##   xgbTree 0.06817997 0.004673665
##    glmnet 0.07962689 0.004252056
##  glmnet.1 0.07866448 0.007164357

8.5 Q-Q plot residuals

9 Feature importance

9.1 Ridge regression

9.3 Plot

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

##  [1] "temp"          "sol_filt_pct"  "conduct"       "sol_recep_pct"
##  [5] "Na_gpl"        "sulfato_mg_l"  "tft_visc"      "cot"          
##  [9] "x10_pct"       "hum_pct"       "ca_ppm"        "tft_floc"     
## [13] "insol"         "cloruro_mg_l"  "esf_corte"     "tft_ca_pct"   
## [17] "ph"            "Cu_S_gpl"
## # A tibble: 18 x 2
##    Feature       Importance
##    <fct>              <dbl>
##  1 temp             0.0290 
##  2 sol_filt_pct     0.0157 
##  3 conduct          0.0145 
##  4 sol_recep_pct    0.0126 
##  5 Na_gpl           0.0103 
##  6 sulfato_mg_l     0.00863
##  7 tft_visc         0.00834
##  8 cot              0.00773
##  9 x10_pct          0.00703
## 10 hum_pct          0.00588
## 11 ca_ppm           0.00578
## 12 tft_floc         0.00515
## 13 insol            0.00509
## 14 cloruro_mg_l     0.00419
## 15 esf_corte        0.00339
## 16 tft_ca_pct       0.00273
## 17 ph               0.00246
## 18 Cu_S_gpl         0.00241

##          temp  sol_filt_pct       conduct sol_recep_pct        Na_gpl 
##   0.028976397   0.015692323   0.014545557   0.012552678   0.010283734 
##  sulfato_mg_l      tft_visc           cot       x10_pct       hum_pct 
##   0.008626686   0.008341392   0.007728433   0.007027991   0.005878100 
##        ca_ppm      tft_floc         insol  cloruro_mg_l     esf_corte 
##   0.005777914   0.005153675   0.005088682   0.004191773   0.003394108 
##    tft_ca_pct            ph      Cu_S_gpl 
##   0.002729461   0.002461570   0.002412079

9.4 Summary plot

The following plot shows the most important features, and their impact on the filtering rate.

9.5 Force plot.

A SHAP force plot shows the contribution of the most important features divided by quantity.

## The SHAP values of the Rest 14 features were summed into variable 'rest_variables'.
## Data has N = 466 | zoom in length is 50 at location 250.

9.6 Importance plot.

The distribution of the real values agaisnt the contribution that they have on the filtering rate.